Math Review + Probability Theory

Chelsea Parlett-Pelleriti

Probability

Probability Spaces

\[ \left ( \Omega, \mathcal{B}, P \right ) \]

Probability Spaces

\[ \left ( \color{#E69F00}{\Omega}, \mathcal{B}, P \right ) \]

Sample Space: Set of outcomes

  • {1,2,3,4,5,6}
  • {H,T}

Probability Spaces

\[ \left ( \Omega, \color{#56B4E9}{\mathcal{B}}, P \right ) \]

Event Space: Set of events

  • even dice rolls: {2,4,6}
  • dice rolls > 4: {5,6}

Probability Spaces

\[ \left ( \Omega, \color{#56B4E9}{\mathcal{B}}, P \right ) \]

\(\mathcal{B}\) is modeled as a \(\sigma\)-algebra of \(\mathcal{B}\): a set of subsets of \(\Omega\) with the following properties:

  1. \(\emptyset \in \mathcal{B}\)
  2. \(A \in \mathcal{B} \to A^c \in \mathcal{B}\)
  3. \(A_1, A_2,... A_n \in \mathcal{B} \to \cup_{i=1}^{\infty}A_i \in \mathcal{B}\)

Probability Spaces

\[ \left ( \Omega, \color{#56B4E9}{\mathcal{B}}, P \right ) \]

There can be multiple \(\sigma\)-algebras for a single sample space. For example:

  1. Trivial: \(\left \{ \emptyset, \Omega \right\}\)
  2. Powerset: all subsets of \(\Omega\)

Probability Spaces

\[ \left ( \Omega, \mathcal{B}, \color{#009E73}{P} \right ) \]

\(P\) is a Probability Function that takes \(\mathcal{B}\) and assigns a number (a probability; \(0\leq p \leq 1\)) to each event in \(\mathcal{B}\).

Probability Spaces

\[ \left ( \Omega, \mathcal{B}, \color{#009E73}{P} \right ) \]

The probability function must satisfy:

  1. \(P(A) \geq 0 \text{ for all } A \in \mathcal{B}\)
  2. \(P(\Omega) = 1\)
  3. If \(A_1, A_2,... \in \mathcal{B}\), \(A_i \cap A_j = \emptyset\) \(\to\) \(P(\cup_{i=1}^{\infty}A_i) =\) \(\sum_{i=1}^{\infty} P(A_i)\)

Set Theory

  • \(A \cup B = B \cup A\)
  • \(\left ( A \cup B \right)^c = A^c \cap B^c\)
  • \(\left ( A \cap B \right)^c = A^c \cup B^c\)
  • \(\left (A \cup B \right ) \cup C = A \cup \left (B \cup C\right)\)
  • \(A \cap \left ( B \cup C \right) = \left( A \cap B \right ) \cup \left( A \cap C \right)\)
  • \(A \cup \left (B \cap C\right) = \left( A \cup B \right ) \cap \left( A \cup C \right )\)

Probability

Why was all that important? It allows us to know things like:

  • \(P(A^c) = 1- P(A)\)
  • \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\)
  • \(P(B \cap A^c) = P(B) - P(A \cap B)\)

which allow us to do all of statistics.

Conditional Probability

For \(A,B \in \mathcal{B}\) and \(P(B) \neq 0\):

\[ P(A \mid B) = \frac{P(A\cup B)}{P(B)} \]

Conditioning on \(B\) means that \(B\) is now the sample space \(\Omega\) (reduces the population of outcomes we’re looking at)

Independence

\[P(A \mid B) = P(A)\]

Learning that \(B\) happened does not give us any information about whether \(A\) happened, and therefore \(A\) and \(B\) are independent.

Bayes’ Rule

\[ P(A \mid B) = \frac{P(A\cup B)}{P(B)} = \frac{P(B \mid A) \cdot P(A)}{P(B)} \]

Bayes’ Rule is a way to calculate conditional probabilities

Bayes’ Rule

\[ P(\text{covid} \mid +) = \frac{P(+ \mid \text{covid}) \cdot P(\text{covid})}{P(+)} \]

What is the probability of having covid given that you got a positive covid test?

\[ P(\text{covid}) = 0.01 \\ \text{Sensitivity} = 0.65 \\ \text{Specificity} = 0.90 \]

Bayes’ Rule

\[ P(\text{theory} \mid \text{data}) = \frac{P(\text{data} \mid \text{theory}) \cdot P(\text{theory})}{\text{data}} \]

“How did my theory change after seeing the data?”

Bayes’ Rule

\[ \color{#D55E00}{P(\text{theory} \mid \text{data})} = \frac{P(\text{data} \mid \text{theory}) \cdot P(\text{theory})}{\text{data}} \]

The Posterior is the probability of our theory after seeing the data.

Bayes’ Rule

\[ P(\text{theory} \mid \text{data}) = \frac{P(\text{data} \mid \text{theory}) \cdot \color{#CC79A7}{P(\text{theory})}}{\text{data}} \]

The Prior is the probability of our theory before seeing the data.

Bayes’ Rule

\[ P(\text{theory} \mid \text{data}) = \frac{\color{#F5C710}{P(\text{data} \mid \text{theory})} \cdot P(\text{theory})}{\text{data}} \]

The Likelihood is the probability of our data, assuming the theory is true (how well the theory fits the data).

Bayes’ Rule

\[ P(\text{theory} \mid \text{data}) = \frac{P(\text{data} \mid \text{theory}) \cdot P(\text{theory})}{\color{#009E73}{\text{data}}} \]

The Normalizing Constant is the probability of our data.It normalizes the posterior so that it’s a valid probability (distribution).

It does not matter.

Probability and Likelihood

\(P(x | \theta)\)

\(L(\theta | x)\)

Random Variables

Random Variables

Poll 100 people about whether they’ll vote for A (\(1\)) or not (\(0\)).

There are \(2^{100}\) ways for these 100 people to respond:

\[\begin{bmatrix} 0 & 1 & 0 & \ldots & 1 \\ 0 & 0 & 1 & \ldots & 1\\ 0 & 0 & 0 & \ldots & 1\\ 0 & 0 & 0 & \ldots & 1\\ \ldots & \ldots & \ldots & \ldots & \ldots\\ 0 & 0 & 0 & \ldots & 1 \end{bmatrix}\]

Random Variables

Poll 100 people about whether they’ll vote for A (\(1\)) or not (\(0\)).

There are \(2^{100}\) ways for these 100 people to respond: do you care about all \(2^{100}\) ways?

Random Variables

Poll 100 people about whether they’ll vote for A (\(1\)) or not (\(0\)).

There are \(2^{100}\) ways for these 100 people to respond: do you care about all \(2^{100}\) ways?

Random Variables

Poll 100 people about whether they’ll vote for A (\(1\)) or not (\(0\)).

No. Votes No. Outcomes
0 1
1 100
2 4950
100 1

The summary statistic: number of voters.

Random Variables

Assume \(P(1) = P(0) = 0.5\)

No. Votes No. Outcomes p
0 1 \(\frac{1}{2^{100}}\)
1 100 \(\frac{100}{2^{100}}\)
2 4950 \(\frac{4950}{2^{100}}\)
100 1 \(\frac{1}{2^{100}}\)

What is \(P(\text{votes} \leq 40)\) ?

PMF + CDF

cumulative distribution function: \(F_{X}(x) = P_{X}(X \leq x) = \sum_{x_i: X \leq x} P(x_i)\)

PMF + CDF

probability mass function: \(f_{X}(x) = P(X = x)\)

PDF + CDF

cumulative distribution function: \(F_{X}(x) = P_{X}(X \leq x) = \int_{-\infty}^x f(X) dX\)

PDF + CDF

probability density function: \(f_{X}(x) = \frac{d}{dX} F_{X}(x)\)

Expected Values

expected value \(\mathbb{E}(x)\): typical or average value for the random variable

  • discrete: \(\mathbb{E}(x) = \sum_i^nxf_X(x) = \sum_i^nxP(X = x)\)
  • continuous: \(\mathbb{E}(x) = \int_{-\infty}^{\infty}xf_X(x) dx\)

Expected Values

\[\mathbb{E}(x) = \sum_i^n\color{#E69F00}{\underbrace{x}_\text{value of x}}\color{#56B4E9}{\overbrace{f_X(x)}^\text{probability of x}}\]

The expected value is the sum of the possible values weighted by the probability of that value.

Moments + Moment Generating Functions

Moments of a distribution are expectations.

\[ \mu'_n = \mathbb{E}X^n \]

Central Moments replace \(X\) with the mean centered value \(X-\mu\).

\[ \mu_n = \mathbb{E}(X-\mu)^n \]

Important Moments

  • mean: 1st moment

  • variance: 2nd central moment

  • skew: 3rd central moment

  • kurtosis: 4th central moment

Mean + Variance of Transformations

  • \(\mathbb{E}(aX + b) = a*\mathbb{E}(X) + b\)

  • \(\mathbb{E}(X + Y) = \mathbb{E}(X) + \mathbb{E}(Y)\)

  • \(Var(aX + b) = a^2Var(X)\)

  • \(Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)\)

  • \(Var(X - Y) = Var(X) + Var(Y) - 2Cov(X,Y)\)

Mean + Variance of Transformations

Test yourself!

  • \(\mathbb{E}(X + 2Y); X \sim N(0,10); Y \sim N(2,2)\)

  • \(Var(X + 2Y); X \sim N(0,10); Y \sim N(2,2)\)

Mean + Variance of Transformations: Peek into the Future

Don’t calculate, simulate!

# X ~ N(0,1)
X <- rnorm(1e5,0,1) # n,mu,sd
Z <- X*4.5 + 10
W <- X^2
print(paste0("Z: ", round(mean(Z),2), ", W: ", round(mean(W),2)))
[1] "Z: 9.99, W: 1.01"

Fantastic Distributions and Where to Find Them

Normal

  • \(\mu\): mean/median/center

  • \(\sigma\): standard deviation

Student t

  • \(\nu\): degrees of freedom (smaller = higher kurtosis)

Note: when \(\nu\) > 30, the Student t distribution is quite close to \(N(0,1)\)

Cauchy

  • \(location\): center of distribution

  • \(scale\): width of distribution

Note: Cauchy distributions have undefined mean and variance. It’s the same as a student-t distribution where \(\nu =1\)

Logistic

  • \(\mu\): mean of distribution

  • \(s\): scale of distribution

Note: you might be more familiar with the CDF of the logistic distribution: \(\frac{1}{1 + e^{-x}}\), from logistic regression

Beta

  • \(\alpha\): number of successes

  • \(\beta\): number of failures

Note: you can also parameterize the beta distribution with \(\mu = \frac{\alpha}{\alpha + \beta}\) and \(\sigma^2 = \frac{\alpha\beta}{(\alpha + \beta)^2(\alpha + \beta + 1)}\) which is easier in some cases

Binomial

  • \(n\): number of trials

  • \(p\): probability of success

Note: a Bernoulli distribution is just a binomial where \(n = 1\)

Dirichlet

  • \(\alpha\): a simplex of probabilities for each outcome

Note: Dirichlet is a generalization of the Beta distribution when there are more than 2 possible outcomes. Thank you to Andrew Heiss for a version of this plot code.

Poisson

  • \(\lambda\): the rate, the expected number of counts

Note: mean = variance = \(\lambda\), sometimes counts have more variance than this, called overdispersion. In this case, a negative binomial distribution is more appropriate.

Poisson

Gamma

  • \(k\): shape

  • \(\theta\): scale

Graph Theory

Graphs

Graphs

Graphs

Graphs

KNN Graph

Graphs

Instagram

❓ How would you identify an “influencer” in a graph?

Dihyat, M. M., Malik, K., Khan, M. A., & Imran, B. (2021). Detecting ideal Instagram influencer using social network analysis. arXiv preprint arXiv:2107.05731.

from:https://oracleofbacon.org/graph.php

Nodes/Vertices

Edges

Edges

Path

Six Degrees of Kevin Bacon

from:https://oracleofbacon.org/graph.php

Cycles

Cycles

Requirements:

  • start/end at same node

  • do not pass through same node > 1

Directed Acyclic Graphs (Preview)

  • Directed: edges have a direction from one node to another

  • Acyclic: there’s no cycles

  • Graph: …

Directed Acyclic Graphs (Preview)

❓ Why would cycles be a problem in a causal DAG?

from: https://health.ucdavis.edu/media-resources/ctsc/documents/pdfs/directed-acyclic-graphs20220209.pdf

Cliques

Cliques

Clique: subsets of nodes where each pair of nodes in the clique are connected by an edge.



❓On a show like Love Island where only straight couples are allowed. What is the maximum clique size of a graph with nodes = contestants, and edges = couplings? Does it change based on the number of contestants?

Density

Density: how many edges exist, compared to the number of possible edges. How “connected” the graph is.

\[ \rho = \frac{2m}{\underbrace{n(n-1)}_\text{possible edges}} \]

where \(m\) is the number of edges, and \(n\) is the number of nodes.

Density

In DBSCAN, we make a graph where data points are nodes and edges exist between neighbors. We’re looking for sub-graphs that are densely connected.

minPts influences how dense the sub-graphs need to be, and eps determines which edges we’ll add to the graph as a whole.

Information Theory

Entropy

\[ H(p) = - \sum_{i=1}^N p(x_i) \times log(p(x_i)) \]

A measure of disorder, uncertainty, surprise or chaos.

Entropy


Information in an event: \(-log(p(x))\) , so \(H(p)\) gives you the expected information.

❓ When do we get the most information?

Entropy

Entropy for fair coin toss:

\[ -(0.5 * log(0.5) + 0.5 * log(0.5)) \approx 0.3 \]

Entropy for unfair coin toss:

\[-(0.99 * log(0.99) + 0.01 * log(0.01)) \approx 0.02 \]

Cross Entropy

\[ H(p,q) = - \sum_{i=1}^N p(x_i) \times log(q(x_i)) \]

If we think \(q(x)\) is the probability distribution, but \(p(x)\) is the real probability distribution, how surprised will we be?

Cross Entropy

\[ H(A,B) = - \left (\frac{1}{8} * log_2 \left (\frac{7}{8} \right) + \frac{7}{8} * log_2 \left (\frac{1}{8} \right) \right) = 2.6 \]

If we think \(q(x)\) is the probability distribution, but \(p(x)\) is the real probability distribution, how surprised will we be?

Cross Entropy

Example: cross-entropy loss

cross_entropy <- function(p,q){
  -1 * sum(p * log(q, base = 2)) 
}
pred_probs <- c(0.2, 0.5, 0.3) # ML prediction
actual     <- c(0,1,0)

cross_entropy(actual, pred_probs)
[1] 1

Cross Entropy

Example: cross-entropy loss

cross_entropy <- function(p,q){
  -1 * sum(p * log(q, base = 2)) 
}
pred_probs <- c(0.005,0.99,0.005) # ML prediction
actual     <- c(0,1,0)

cross_entropy(actual, pred_probs)

Cross Entropy

Example: cross-entropy loss

cross_entropy <- function(p,q){
  -1 * sum(p * log(q, base = 2)) 
}
pred_probs <- c(0.005,0.99,0.005) # ML prediction
actual     <- c(0,1,0)

cross_entropy(actual, pred_probs)
[1] 0.01449957

KL Divergence

\[D_{KL} \left (p||q \right ) = \underbrace{H(q,p)}_\text{cross entropy} - \underbrace{H(p)}_\text{entropy}\]

Measure of distance between distributions \(p\) and \(q\).


How much information do you lose if you approximate \(p\) with \(q\)?